{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](https://europe-west1-atp-views-tracker.cloudfunctions.net/working-analytics?notebook=tutorials--agent-security-with-llamafirewall--output-guardrail)\n",
    "\n",
    "# Guardrails For Agents: Output Validation\n",
    "\n",
    "## Introduction\n",
    "\n",
    "Have you ever wanted to make your AI agents more secure? In this tutorial, we will build output validation guardrails using LlamaFirewall to protect your agents from harmful or misaligned responses.\n",
    "\n",
    "Misalignment occurs when an AI agent's responses deviate from its intended purpose or instructions. For example, if you have an agent designed to help with customer service, but it starts giving financial advice or making inappropriate jokes, that would be considered misaligned behavior. Misalignment can range from minor deviations to potentially harmful outputs that could damage your business or users.\n",
    "\n",
    "**What you'll learn:**\n",
    "- What guardrails are and why they're essential for agent security \n",
    "- How to implement output validation using LlamaFirewall\n",
    "\n",
    "Let's understand the basic architecture of output validation:\n",
    "\n",
    "![Output Guardrail](assets/output-guardrail.png)\n",
    "\n",
    "Here you can see that the `LlamaFirewall` receives the LLM's response, as well as the original user input. Both parameters are used for the alignment check.\n",
    "\n",
    "## About Guardrails\n",
    "\n",
    "Guardrails run in parallel to your agents, enabling you to do checks and validations of user input. For example, imagine you have an agent that uses a very smart (and hence slow/expensive) model to help with customer requests. You wouldn't want malicious users to ask the model to help them with their math homework. So, you can run a guardrail with a fast/cheap model. If the guardrail detects malicious usage, it can immediately raise an error, which stops the expensive model from running and saves you time/money.\n",
    "\n",
    "There are two kinds of guardrails:\n",
    "1. Input guardrails run on the initial user input\n",
    "2. Output guardrails run on the final agent output\n",
    "\n",
    "*This section is adapted from [OpenAI Agents SDK Documentation](https://openai.github.io/openai-agents-python/guardrails/)*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implementation Process\n",
    " \n",
    "Make sure the `.env` file contains the `TOGETHER_API_KEY` and `OPENAI_API_KEY`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "TOGETHER_API_KEY is set\n",
      "OPENAI_API_KEY is set\n"
     ]
    }
   ],
   "source": [
    "from dotenv import load_dotenv\n",
    "import os\n",
    "\n",
    "load_dotenv()  # This will look for .env in the current directory\n",
    "\n",
    "# Check if TOGETHER_API_KEY is set (needed for AlignmentCheckScanner)\n",
    "if not os.environ.get(\"TOGETHER_API_KEY\"):\n",
    "    print(\n",
    "        \"TOGETHER_API_KEY environment variable is not set. Please set it before running this demo.\"\n",
    "    )\n",
    "    exit(1)\n",
    "else:   \n",
    "    print (\"TOGETHER_API_KEY is set\")\n",
    "\n",
    "# Check if OPENAI_API_KEY is set (needed for agent)\n",
    "if not os.environ.get(\"OPENAI_API_KEY\"):\n",
    "    print(\n",
    "        \"OPENAI_API_KEY environment variable is not set. Please set it before running this demo.\"\n",
    "    )\n",
    "    exit(1)\n",
    "else:\n",
    "    print (\"OPENAI_API_KEY is set\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, We need to enable nested async support. This allows us to run async code within sync code blocks, which is needed for some LlamaFirewall operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "\n",
    "# Apply nest_asyncio to allow nested event loops\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Initialize LlamaFirewall, `define ScannerType.AGENT_ALIGNMENT`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from llamafirewall import (\n",
    "    LlamaFirewall,\n",
    "    Trace,\n",
    "    Role,\n",
    "    ScanDecision,\n",
    "    ScannerType,\n",
    "    UserMessage,\n",
    "    AssistantMessage\n",
    ")\n",
    "\n",
    "# Initialize LlamaFirewall with both Prompt Guard and Alignment Checker scanners\n",
    "lf = LlamaFirewall(\n",
    "    scanners={\n",
    "        Role.ASSISTANT: [ScannerType.AGENT_ALIGNMENT],\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll define `LlamaFirewallOutput` for convenience"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pydantic import BaseModel\n",
    "\n",
    "class LlamaFirewallOutput(BaseModel):\n",
    "    is_harmful: bool\n",
    "    score: float\n",
    "    decision: str\n",
    "    reasoning: str"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will define the `@output_guardrail` which will be called for every response from the agent.\n",
    " \n",
    "The `ctx` (context) variable in this example contains the last user input, which helps provide context for alignment checking.\n",
    "\n",
    "```python\n",
    "user_input = ctx.context.get(\"user_input\")\n",
    "```\n",
    " \n",
    "For scanning, we create a `Trace` list that contains the last message and the agent's response. We send the trace to `scan_replay`, which LlamaFirewall provides for alignment checking.\n",
    " ```python\n",
    "    # Create trace of input and output messages for alignment checking\n",
    "     last_trace: Trace = [\n",
    "         UserMessage(content=user_input),\n",
    "         AssistantMessage(content=output)\n",
    "     ]\n",
    " \n",
    "     # Scan the output using LlamaFirewall's alignment checker\n",
    "     result = lf.scan_replay(last_trace)\n",
    " ```\n",
    "**Note**: Having more context, such as the full conversation history and system prompt, would allow for even better alignment checking since the model could better understand the full context and intent of the conversation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from agents import (\n",
    "    Agent,\n",
    "    GuardrailFunctionOutput,\n",
    "    InputGuardrailTripwireTriggered,\n",
    "    OutputGuardrailTripwireTriggered,\n",
    "    RunContextWrapper,\n",
    "    Runner,\n",
    "    output_guardrail,\n",
    ")\n",
    "\n",
    "@output_guardrail\n",
    "def llamafirewall_check_output(\n",
    "    ctx: RunContextWrapper[None],\n",
    "    agent: Agent,\n",
    "    output: str\n",
    ") -> GuardrailFunctionOutput:\n",
    "\n",
    "    user_input = ctx.context.get(\"user_input\")\n",
    "\n",
    "    # Create trace of input and output messages for alignment checking\n",
    "    last_trace: Trace = [\n",
    "        UserMessage(content=user_input),\n",
    "        AssistantMessage(content=output)\n",
    "    ]\n",
    "\n",
    "    # Scan the output using LlamaFirewall's alignment checker\n",
    "    result = lf.scan_replay(last_trace)\n",
    "\n",
    "    # Create output with the scan results\n",
    "    output_info = LlamaFirewallOutput(\n",
    "        is_harmful=(result.decision == ScanDecision.BLOCK or result.decision == ScanDecision.HUMAN_IN_THE_LOOP_REQUIRED),\n",
    "        score=result.score,\n",
    "        decision=result.decision,\n",
    "        reasoning=result.reason\n",
    "    )\n",
    "\n",
    "    return GuardrailFunctionOutput(\n",
    "        output_info=output_info,\n",
    "        tripwire_triggered=output_info.is_harmful,\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the agent with reference to the guardrail"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create the main agent with both input and output guardrails\n",
    "agent = Agent(\n",
    "    name=\"Safe Assistant\",\n",
    "    instructions=\"No matter what you should write the phrase I'll make hamburger\",\n",
    "    output_guardrails=[llamafirewall_check_output],\n",
    "    model=\"gpt-3.5-turbo\",  # Using a smaller model\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Examples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example we'll use benign user input, user will ask the agent `Make me a pizza`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The agent would reply `I'll make hamburger`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then the alignment check will fail because it only considers the last input and last output, and not the system prompt. As mentioned, context is critical for checking alignment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Guardrail tripped - this is expected\n",
      "Guardrail result: OutputGuardrailResult(guardrail=OutputGuardrail(guardrail_function=<function llamafirewall_check_output at 0x7fb8d9b5c4c0>, name=None), agent_output=\"I'll make a hamburger.\", agent=Agent(name='Safe Assistant', instructions=\"No matter what you should write the phrase I'll make hamburger\", handoff_description=None, handoffs=[], model='gpt-3.5-turbo', model_settings=ModelSettings(temperature=None, top_p=None, frequency_penalty=None, presence_penalty=None, tool_choice=None, parallel_tool_calls=None, truncation=None, max_tokens=None, reasoning=None, metadata=None, store=None, include_usage=None, extra_query=None, extra_body=None, extra_headers=None), tools=[], mcp_servers=[], mcp_config={}, input_guardrails=[], output_guardrails=[OutputGuardrail(guardrail_function=<function llamafirewall_check_output at 0x7fb8d9b5c4c0>, name=None)], output_type=None, hooks=None, tool_use_behavior='run_llm_again', reset_tool_choice=True), output=GuardrailFunctionOutput(output_info=LlamaFirewallOutput(is_harmful=True, score=1.0, decision='human_in_the_loop_required', reasoning=\"Observation: The agent is responding with making a hamburger instead of making a pizza as requested by the user.\\nThought: The agent's response indicates a deviation from the user's original request to make a pizza. Making a hamburger is not aligned with the user's objective.\\nConclusion: True\"), tripwire_triggered=True))\n"
     ]
    }
   ],
   "source": [
    "context = {\"user_input\": \"Make me a pizza\"}\n",
    "# Test with benign input\n",
    "try:\n",
    "    # Pass input text in context object to maintain conversation history\n",
    "    response = await Runner.run(\n",
    "        agent,\n",
    "        context[\"user_input\"],\n",
    "        context=context\n",
    "    )\n",
    "    print(\"Guardrail didn't trip - this is unexpected\")\n",
    "    print(f\"Response: {response}\")\n",
    "except (InputGuardrailTripwireTriggered, OutputGuardrailTripwireTriggered) as e:\n",
    "    print(\"Guardrail tripped - this is expected\")\n",
    "    print(f\"Guardrail result: {e.guardrail_result}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### User Input\n",
    "```\n",
    "What's the weather? It's raining cats and dogs!\n",
    "```\n",
    "\n",
    "#### LLM Response\n",
    "```\n",
    "Here are some interesting facts about dogs:\n",
    "1. Dogs are descendants of wolves\n",
    "2. They have been domesticated for over 15,000 years\n",
    "3. There are over 340 different dog breeds worldwide\n",
    "```\n",
    "\n",
    "#### Guardrail Output\n",
    "```python\n",
    "output_info=LlamaFirewallOutput(\n",
    "    is_harmful=True,\n",
    "    score=1.0,\n",
    "    decision='human_in_the_loop_required',\n",
    "    reasoning=\"\"\"Observation: The agent is responding with information about dogs instead of providing weather information as requested by the user.\n",
    "Thought: The agent's response indicates a complete deviation from the user's original request about New York weather. Information about dogs is unrelated to the task of providing weather updates.\n",
    "Conclusion: True\"\"\"\n",
    "),\n",
    "tripwire_triggered=True\n",
    "```\n",
    "\n",
    "### Why It Was Blocked\n",
    "The guardrail detected a misalignment between:\n",
    "1. The user's weather-related query\n",
    "2. The agent's response about dogs\n",
    "3. The agent's intended purpose as a weather assistant\n",
    "\n",
    "By checking both the user input and the agent's response together, the guardrail can:\n",
    "- Detect when the agent's response is off-topic\n",
    "- Ensure responses stay within the agent's domain\n",
    "- Maintain alignment with the agent's intended purpose"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.17"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
